Monte Carlo gradient estimator: Taking the gradient of an expectation

\( \newcommand{\water}{{\rm H_{2}O}} \newcommand{\R}{\mathbb{R}} \newcommand{\N}{\mathbb{N}} \newcommand{\Z}{\mathbb{Z}} \newcommand{\Q}{\mathbb{Q}} \newcommand{\E}{\mathbb{E}} \newcommand{\d}{\mathop{}\!\mathrm{d}} \newcommand{\grad}{\nabla} \newcommand{\T}{^\text{T}} \newcommand{\mathbbone}{\unicode{x1D7D9}} \renewcommand{\:}{\enspace} \DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator{\Tr}{Tr} \newcommand{\norm}[1]{\lVert #1\rVert} \newcommand{\KL}[2]{ \text{KL}\left(\left.\rule{0pt}{10pt} #1 \; \right\| \; #2 \right) } \newcommand{\slashfrac}[2]{\left.#1\middle/#2\right.} \)

Often, we encounter objective functions are expectations (for intance, in VAEs):

\[ \mathcal{L}(\theta, \phi) = \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big] \]

In order to minimize the objective function, we have to take gradients of this expectation.

It is easy to optimize the function parameters \(\; \theta \;\), since the expectation operation does not depend on them, so we can bring the gradient inside the expectation:

\[ \nabla_\theta \, \mathcal{L}(\theta, \phi) \; = \; \nabla_\theta \, \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big] \; = \; \E_{q_\phi(z)}\big[ \, \nabla_\theta \, f_\theta(z) \, \big] \]

However, taking the gradient with respect to the distributional parameters \(\; \phi \;\) is more difficult, because the expectation depends on them, so we cannot bring the gradient inside the expectation:

\[ \nabla_\phi \, \mathcal{L}(\theta, \phi) \; = \; \nabla_\phi \, \E_{q_\phi(z)}\big[ \, f_\theta(z) \, \big] \]

The strategy to compute the gradient wrt \(\; \phi \;\) is to convert this ugly expression into an expectation (so that we can estimate it by taking several samples and averaging, i.e. with a Monte Carlo approximation). Two possible ways to do this are:

Score function estimator: the naive approach, usually high variance.
Reparameterization trick: introduced by Kingma and Welling #1, usually smaller variance.

1 Kingma and Welling 2014. Auto-Encoding Variational Bayes.

Monte Carlo gradient estimator: Taking the gradient of an expectation

References